Text from social media provides a set of challenges that can causetraditional NLP approaches to fail. Informal language, spelling errors,abbreviations, and special characters are all commonplace in these posts,leading to a prohibitively large vocabulary size for word-level approaches. Wepropose a character composition model, tweet2vec, which finds vector-spacerepresentations of whole tweets by learning complex, non-local dependencies incharacter sequences. The proposed model outperforms a word-level baseline atpredicting user-annotated hashtags associated with the posts, doingsignificantly better when the input contains many out-of-vocabulary words orunusual character sequences. Our tweet2vec encoder is publicly available.
展开▼